WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

نویسندگان
چکیده

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Blocking and non-blocking coordinated checkpointing for large scale MPI computation

Nowadays, clusters and grids are made of more and more computing nodes. The programming of multi-processes applications is the most often achieved through message passing. The increase of the number of processes implies that theses applications need to use a fault tolerant message passing library. In this paper, we present two implementations of fault tolerant protocols based on MPICH, a blocki...

متن کامل

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches a...

متن کامل

C3: A System for Automating Application-Level Checkpointing of MPI Programs

Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writi...

متن کامل

Application-level Checkpointing for OpenMP Programs

It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) the state of the computation is saved periodically to disk, and when a failure occurs, the computation is restarted from the last saved state. One common way of doing this, called Systemlevel Checkpointing (SLC), requires modifying the Op...

متن کامل

Application-Level Checkpointing Techniques for Parallel Programs

In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEICE Transactions on Information and Systems

سال: 2012

ISSN: 0916-8532,1745-1361

DOI: 10.1587/transinf.e95.d.786